Add pread/pwrite syscalls and fix strace retry logic#46
Add pread/pwrite syscalls and fix strace retry logic#46
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #46 +/- ##
==========================================
- Coverage 45.49% 44.61% -0.89%
==========================================
Files 4 4
Lines 455 464 +9
==========================================
Hits 207 207
- Misses 248 257 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
|
@copilot Without writing any new code, can you suggest a few approaches to making this package basically work for mmaped files on linux, understanding that syscall tracing is not the way here. |
Here are several approaches to support mmap'd files on Linux without syscall tracing: 1. eBPF-based tracing (most accurate)
2. /proc/self/smaps tracking (estimation-based)
3. perf with page-fault events (system tool)
4. Hybrid approach (fallback strategy)
Recommendation: Start with approach #4 (hybrid) as it requires minimal changes. Add a warning when captured I/O is suspiciously low relative to file size. For full mmap support, approach #2 (/proc/smaps) is most practical without new dependencies. |
|
@copilot A couple of questions:
Please just answer in a comment, don't write any code. |
1) /proc/self/smaps detailed approach:
Before execution:
After execution:
Key fields in smaps: Limitation: RSS includes prefetched pages and doesn't distinguish between first access vs. cached access, so it's an upper bound estimate. 2) Python packages for perf events: Yes, a)
b)
However: Both still need privileges (CAP_PERFMON or relaxed |
Change Description
Issue reports PyArrow parquet reads capture only 3.63 MB for a 1 GB file. Investigation shows current implementation correctly captures 100% of I/O for files ≤407 MB. PyArrow likely uses mmap for files >1 GB, which bypasses syscall tracing (page faults vs read syscalls). This is a fundamental limitation of strace-based measurement.
This PR adds 32-bit syscall support (pread/pwrite) which were missing, though this does not address the mmap limitation.
Solution Description
Syscall coverage:
Code quality:
_start_strace()helper to eliminate duplication in retry pathTesting:
Limitations identified:
strace cannot capture mmap-based I/O (data accessed via page faults, not syscalls). If PyArrow uses mmap for large files (>1 GB threshold), those reads will not appear in syscall traces. Alternative: eBPF/perf with page fault tracking.
Code Quality
Project-Specific Pull Request Checklists
Bug Fix Checklist
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
huggingface.co/usr/bin/curl curl -L REDACTED -o train-00000-of-00007.parquet --max-time 120 DROP(dns block)If you need me to access, download, or install something from one of these locations, you can either:
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.